Example of using ptm_pred to prototype phosphorylation classifiers

Histadine Phosphorylation is a quick place to start, not much data though. However, that means the code runs much faster.

Predictor is the class which handles reading the data, sequence vector is a function which vectorizes a protien sequence into a feature array representing amino acids as integer values between 0-20. 0 represents empty space to average out vector length. It can also include hydrophobicity as a feature.


In [1]:
from pred import Predictor
from pred import sequence_vector

Next we are going to load our data and generate random negative data aka gibberish data. The clean data files has negatives created from the data sets pulled from phosphoELM and dbptm.

In generate_random_data the amino acid parameter represents the amino acid being modified aka the target amino acid modification, the float being passed through is multiplier. For example we use .5 here, that means that .5 * number of data points = random negatives generated.


In [2]:
y = Predictor()
y.load_data(file="Data/Training/clean_Y.csv")


Loading Data
Loaded Data

Next we vectorize the sequences, we are going to use the sequence vector. Now we can apply a data balancing function, here we are using adasyn which generates synthetic examples of the minority (in this case positive) class.


In [3]:
y.process_data(vector_function="sequence", amino_acid="Y", imbalance_function="ADASYN", random_data=0)


Applying Vector Function
Finished Applying Vector Function

Now we can apply a data balancing function, here we are using adasyn which generates synthetic examples of the minority (in this case positive) class.

The array outputed contains the precision, recall, fscore, and total numbers correctly estimated.


In [4]:
y.supervised_training("mlp_adam")


Starting Training
12770 12770
/Users/mark/anaconda3/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
/Users/mark/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py:2515: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return bool(asarray(a1 == a2).all())
Done training
Test Results
             precision    recall  f1-score   support

    Non PTM       0.95      1.00      0.97      3656
        PTM       0.12      0.01      0.02       175

avg / total       0.92      0.95      0.93      3831

Accuracy 0.950926651005
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-56bae75c174d> in <module>()
      1 #y.balance_data("ncl")
----> 2 y.supervised_training("mlp_adam")

~/PycharmProjects/Post-Translational-Modification-Prediction/pred.py in supervised_training(self, classy, scale)
    424         print(classification_report(y_pred=self.test_results, y_true=self.y_test, target_names=["Non PTM", "PTM"]))
    425         print("Accuracy", accuracy_score(y_true=self.y_test, y_pred=self.test_results))
--> 426         print("ROC:", roc_auc_score(y_true=self.y_test, y_score=self.test_results))
    427 
    428     def benchmark(self, benchmark: str, aa: str):

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in roc_auc_score(y_true, y_score, average, sample_weight)
    258     return _average_binary_score(
    259         _binary_roc_auc_score, y_true, y_score, average,
--> 260         sample_weight=sample_weight)
    261 
    262 

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/base.py in _average_binary_score(binary_metric, y_true, y_score, average, sample_weight)
     82 
     83     if y_type == "binary":
---> 84         return binary_metric(y_true, y_score, sample_weight=sample_weight)
     85 
     86     check_consistent_length(y_true, y_score, sample_weight)

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in _binary_roc_auc_score(y_true, y_score, sample_weight)
    253 
    254         fpr, tpr, tresholds = roc_curve(y_true, y_score,
--> 255                                         sample_weight=sample_weight)
    256         return auc(fpr, tpr, reorder=True)
    257 

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    503     """
    504     fps, tps, thresholds = _binary_clf_curve(
--> 505         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
    506 
    507     # Attempt to drop thresholds corresponding to points in between and

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    312              array_equal(classes, [-1]) or
    313              array_equal(classes, [1]))):
--> 314         raise ValueError("Data is not binary and pos_label is not specified")
    315     elif pos_label is None:
    316         pos_label = 1.

ValueError: Data is not binary and pos_label is not specified

Next we can check against the benchmarks pulled from dbptm.


In [ ]:
y.benchmark("Data/Benchmarks/phos.csv", "Y")

Want to explore the data some more, easily generate PCA and TSNE diagrams of the training set.


In [ ]:
y.generate_pca()

In [ ]:
y.generate_tsne()

There you have it, you have prototype a Tyrosine classifier.


In [ ]: